Skip to content

[BugFix] Fix Whitelist optimization CI failure#3290

Merged
hsliuustc0106 merged 27 commits into
vllm-project:mainfrom
xiaohajiayou:whitelist-optimization-v2
May 6, 2026
Merged

[BugFix] Fix Whitelist optimization CI failure#3290
hsliuustc0106 merged 27 commits into
vllm-project:mainfrom
xiaohajiayou:whitelist-optimization-v2

Conversation

@xiaohajiayou
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Reapply the deploy override field derivation change that was reverted by #3287, and explicitly restore the previous deploy behavior for prefix caching.

The previous attempt allowed omitted deploy fields to fall through to vLLM defaults. For enable_prefix_caching, that changed behavior from Omni's previous default False to vLLM's model-dependent fallback. For decoder/generative stages, vLLM usually resolves this to True, which exposed unsupported Omni multi-stage prefix-cache paths and caused L3/L4 CI failures.

This PR keeps the config refactor, but makes the old behavior explicit by setting enable_prefix_caching: false on all deploy stages.

Changes


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Reapply the deploy override field derivation that was reverted in vllm-project#3287 and make prefix-cache behavior explicit in deploy configs. This preserves the config refactor while restoring the previous Omni behavior where deploy stages do not accidentally fall through to vLLM's model-dependent prefix-cache default.

Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 230ece7 to 6478b2c Compare May 1, 2026 09:39
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 1, 2026

Could you help run the full CI checks to make sure there are no other issues? @lishunyang12 @Gaohan123 @hsliuustc0106

@lishunyang12 lishunyang12 added the merge-test label to trigger buildkite merge test CI label May 2, 2026
Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 8a461ab to da464af Compare May 2, 2026 08:37
Comment thread tests/e2e/online_serving/test_mimo_audio.py
Comment thread vllm_omni/engine/async_omni_engine.py Outdated
Comment thread vllm_omni/config/stage_config.py
@xiaohajiayou xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 2e08140 to 235a452 Compare May 2, 2026 14:01
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

Updated according to your comments:

  • skipped the known MiMo CI failure case for now
  • reverted the unnecessary enforce_eager refactor
  • grouped comments for the StageDeployConfig parameters

Could you please take another look when you have time and let me know if there are any remaining issues?

Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the whitelist-optimization-v2 branch from d73a938 to 3675d56 Compare May 2, 2026 14:07
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

What I validated:

  • DCO, pre-commit, mergeability all passing
  • Fresh unit tests cover the core config nullification behavior (test_default_stage_config_ignores_none_deploy_overrides, test_to_omegaconf_omits_none_deploy_overrides_for_engine_args, test_deploy_override_fields_include_deploy_schema_fields)
  • deploy_override_field_names() unification into stage_config.py is clean and removes the duplicated allowlist from arg_utils.py
  • enable_prefix_caching: false is now explicit on all deploy YAML stages — this is the right fix for the CI regression

What must change before approval:

  1. cosyvoice3.yaml: disable_hybrid_kv_cache_manager: true was silently removed and replaced with enable_prefix_caching: false. These are different settings (hybrid KV cache manager vs prefix caching). If this was intentional, please explain why disabling the hybrid KV cache manager is no longer needed for cosyvoice3. If unintentional, restore it alongside the new enable_prefix_caching line.

  2. Removal of engine_args.setdefault("max_num_seqs", 1) without ensuring all deploy YAMLs set max_num_seqs. Deploy configs like qwen3_omni_moe.yaml (all 3 stages) and cosyvoice3.yaml (both stages) don't set max_num_seqs. With the old setdefault, they got max_num_seqs=1. Now they will fall through to vLLM EngineArgs default of 256. This could change scheduling behavior and memory allocation. Either restore the setdefault or add explicit max_num_seqs values to every stage in every deploy YAML that currently omits it.

  3. Missing test evidence. The PR body's test plan and test results sections are empty. This is a >10 file change that previously caused CI failures (#3287). Please run L3 tests locally and paste the results.

Non-blocking:

  • buildkite CI is still pending — wait for it to complete before merging

Reviewed by Claude Code

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 2, 2026

Previously, the fix added several missing fields in deploy_config that were exposed during model migration, and set defaults in deploy_config to None, treating the vLLM config as the single source of truth.

However, during CI fixes, two different issues got mixed together:

  1. For defaults removed from deploy_config:
    if a field was not explicitly set in the original YAML, it now needs to be explicitly added back to preserve behavior consistency before and after this PR.

  2. Some migrated YAMLs are missing fields and implicitly rely on defaults.
    In addition, some of these fields may not need to be user-configurable and could be handled in the pipeline (e.g., pipeline.py).

To keep this PR focused and easier to reason about, this PR only addresses the first issue (i.e., preserving previous behavior for migrated YAMLs).
The second issue will be handled in a follow-up, where we can more systematically clean up and define intended configs.

Based on this, this PR updates the migrated deploy YAMLs to explicitly restore only those defaults that differ between old Omni behavior and vLLM defaults, as summarized below:

Field Old Omni default vLLM default / final default Action
gpu_memory_utilization 0.9 0.9 no explicit override needed
tensor_parallel_size 1 1 no explicit override needed
enforce_eager False False no explicit override needed
data_parallel_size 1 1 no explicit override needed
pipeline_parallel_size 1 1 no explicit override needed
trust_remote_code True False explicitly preserved per stage where needed
enable_prefix_caching False None (often resolves to True) explicitly preserved where needed
max_num_batched_tokens 32768 inferred by vLLM (e.g., 2048 / 8192 / 16384) explicitly preserved where needed
max_num_seqs 1 often 256 / 1024 explicitly preserved where needed

The goal here is conservative compatibility: keep migrated deploy YAML behavior aligned with pre-refactor Omni defaults, instead of silently falling through to vLLM defaults.

Follow-up todo issue:
#3313

@xiaohajiayou xiaohajiayou force-pushed the whitelist-optimization-v2 branch from a3117dd to 074b866 Compare May 2, 2026 16:25
Signed-off-by: xiaohajiayou <923390377@qq.com>
mm_processor_cache_gb: float | None = None

# Profiling, tokenizer/config parsing, and model-loading behavior.
profiler_config: dict[str, Any] | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread vllm_omni/config/stage_config.py Outdated
… own group

Move devices and tensor_parallel_size into a dedicated "GPU resources
and parallelism" section, leaving stage_id alone as stage identity.
Change devices default from "0" to None, and tighten the None check in
merge_pipeline_deploy to avoid writing a spurious "devices" key.

Signed-off-by: xiaohajiayou <923390377@qq.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

update for ming as well after #3154 merged

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 4, 2026

It seems most of the discussion is now around whether these defaults should be fully removed from DeployConfig and instead be explicitly defined in YAMLs.
Previously, there were two main considerations:

  1. Some fields (e.g., trust_remote_code) are not user-configurable, so they were considered to be handled in pipeline.py.
  2. In [Refactor] Remove redundant StageDeployConfig fields, delegate to vLLM defaults #3128, we also explored removing vLLM-related fields maintained in StageDeployConfig.

Based on these, in this PR , I removed the implicit reliance on defaults and explicitly materialized them in the YAMLs.

My plan is to continue the discussion in a follow-up issue (#3313). This way, even if we later remove vLLM fields from StageDeployConfig, we won’t run into issues caused by implicit default drift.

@hsliuustc0106 hsliuustc0106 removed the omni-test label to trigger buildkite omni model test in nightly CI label May 5, 2026
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 5, 2026

Known good commit: 9d9b720. CI was still passing there.

After that, this PR only had one real own change: ad83e5f, which is mainly YAML field ordering/schema cleanup and does not touch Qwen3-TTS online serving, runtime helpers, or the assertion logic. The rest of the changes were brought in by updating/merging main.

Between 9d9b7209 and the current failing state, the Qwen3-TTS related changes seem to be from main, not this PR:

I also checked the CI results for those individual PRs, and they all passed, which makes this a bit strange.

@amy-why-3459
Copy link
Copy Markdown
Contributor

If possible, could you add an omni-test label to check if the changes in this PR have any impact on performance? @gcanlin @lishunyang12

Comment thread vllm_omni/deploy/qwen3_omni_moe.yaml
Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the whitelist-optimization-v2 branch from 31dcc5d to 065034f Compare May 5, 2026 15:08
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 5, 2026

I ran this Qwen3-TTS CI test locally and it passes on my side. Here are the logs:

  • cmd
cd /root/vllm-omni
MODEL_PREFIX=/root/models /root/vllm-omni/.venv/bin/python -m pytest \
  tests/e2e/online_serving/test_qwen3_tts_base.py::test_text_to_audio_001[async_chunk] \
  -m advanced_model -s --run-level advanced_model
  • result
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
============================================ 1 passed, 18 warnings in 392.15s (0:06:32) =============================================
Click to expand full logs
(vllm-omni) root@autodl-container-xs2vhvepls-41228bbf:~/vllm-omni# cd /root/vllm-omni
MODEL_PREFIX=/root/models /root/vllm-omni/.venv/bin/python -m pytest \
  tests/e2e/online_serving/test_qwen3_tts_base.py::test_text_to_audio_001[async_chunk] \
  -m advanced_model -s --run-level advanced_model
======================================================== test session starts ========================================================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0
rootdir: /root/vllm-omni
configfile: pyproject.toml
plugins: anyio-4.13.0, mock-3.15.1
collecting ... INFO 05-05 23:24:00 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
WARNING 05-05 23:24:00 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:00 [nixl_utils.py:44] NIXL agent config is not available
collected 1 item                                                                                                                    

tests/e2e/online_serving/test_qwen3_tts_base.py INFO 05-05 23:24:00 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-05 23:24:00 [vllm.py:840] Asynchronous scheduling is enabled.
INFO 05-05 23:24:00 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
Path load_format does not exist
Path load_format does not exist
Pre-test GPU status:
[GPU Memory Monitor] Waiting for GPU 0 to free memory, Condition: Memory usage ratio ≤ 5.0%
[GPU Memory Status] Current usage:
  GPU 0: 0.5GiB/32.0GiB (1.6%)
[GPU Memory Freed] Devices 0 meet memory condition
   Condition: Memory usage ratio ≤ 5.0%
   Wait time: 0.0 seconds (0.0 minutes)
Post-test GPU status:

================================================================================
NVIDIA GPU Information (nvidia-smi)
================================================================================
Tue May  5 23:24:00 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:4F:00.0 Off |                  N/A |
| 30%   32C    P8             13W /  320W |       1MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

================================================================================
Detailed GPU Processes (nvidia-smi pmon)
================================================================================
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0          -     -      -      -      -      -      -      -    -              


================================================================================
System Processes with GPU keywords
================================================================================
Launching OmniServer with: /root/vllm-omni/.venv/bin/python -m vllm_omni.entrypoints.cli.main serve /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base --host 127.0.0.1 --port 34085 --omni --trust-remote-code --disable-log-stats --stage-init-timeout 600 --init-timeout 900 --stage-configs-path /tmp/qwen3_tts_op0ng0ik.yaml
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:24:10 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:10 [nixl_utils.py:44] NIXL agent config is not available
INFO 05-05 23:24:12 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 05-05 23:24:12 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 05-05 23:24:12 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 05-05 23:24:12 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 05-05 23:24:12 [logo.py:45] 
(APIServer pid=9684) INFO 05-05 23:24:12 [utils.py:299] vLLM server version 0.20.0, serving model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) INFO 05-05 23:24:13 [utils.py:233] non-default args: {'model_tag': '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', 'host': '127.0.0.1', 'port': 34085, 'model': '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', 'tokenizer_mode': None, 'trust_remote_code': True, 'dtype': None, 'enforce_eager': None, 'config_format': None, 'load_format': None, 'pipeline_parallel_size': None, 'tensor_parallel_size': None, 'data_parallel_size': None, 'gpu_memory_utilization': None, 'mm_processor_cache_gb': None, 'skip_mm_profiling': None, 'compilation_config': None, 'profiler_config': None, 'disable_log_stats': True}
(APIServer pid=9684) INFO 05-05 23:24:13 [omni_base.py:153] [AsyncOmni] Initializing with model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:290] [AsyncOmniEngine] Initializing with model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) WARNING 05-05 23:24:13 [async_omni_engine.py:1418] stage_configs_path is set — the following top-level engine args are ignored (per-stage YAML takes precedence): attention_config, disable_log_stats, eplb_config, ir_op_priority, kernel_config, reasoning_parser_plugin, structured_outputs_config, trust_remote_code
(APIServer pid=9684) WARNING 05-05 23:24:13 [utils.py:191] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:350] [AsyncOmniEngine] Launching Orchestrator thread with 2 stages
(APIServer pid=9684) INFO 05-05 23:24:13 [initialization.py:351] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:767] [AsyncOmniEngine] Initializing stage 0
(APIServer pid=9684) INFO 05-05 23:24:13 [stage_init_utils.py:386] [stage_init] Stage-0 set runtime devices: 0
(APIServer pid=9684) INFO 05-05 23:24:13 [async_omni_engine.py:767] [AsyncOmniEngine] Initializing stage 1
(APIServer pid=9684) WARNING 05-05 23:24:13 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_CLEAN_GPU_MEMORY
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:24:13 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [model.py:555] Resolved architecture: Qwen3TTSTalkerForConditionalGeneration
(APIServer pid=9684) INFO 05-05 23:24:23 [model.py:1680] Using max model len 4096
(APIServer pid=9684) INFO 05-05 23:24:23 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=512.
(APIServer pid=9684) INFO 05-05 23:24:23 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=9684) INFO 05-05 23:24:23 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=9684) INFO 05-05 23:24:23 [async_omni_engine.py:467] [AsyncOmniEngine] Stage 0 engine launch started
(APIServer pid=9684) INFO 05-05 23:24:23 [stage_init_utils.py:386] [stage_init] Stage-1 set runtime devices: 0
(APIServer pid=9684) WARNING 05-05 23:24:23 [envs.py:1818] Unknown vLLM environment variable detected: VLLM_TEST_CLEAN_GPU_MEMORY
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:24:23 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:24:32 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:32 [nixl_utils.py:44] NIXL agent config is not available
(APIServer pid=9684) INFO 05-05 23:24:34 [model.py:555] Resolved architecture: Qwen3TTSCode2Wav
(APIServer pid=9684) INFO 05-05 23:24:34 [model.py:1680] Using max model len 65536
(APIServer pid=9684) INFO 05-05 23:24:34 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=65536.
(APIServer pid=9684) WARNING 05-05 23:24:34 [vllm.py:896] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=9684) WARNING 05-05 23:24:34 [vllm.py:914] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=9684) INFO 05-05 23:24:34 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'])
(APIServer pid=9684) INFO 05-05 23:24:34 [vllm.py:1089] Cudagraph is disabled under eager mode
(APIServer pid=9684) INFO 05-05 23:24:34 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:34 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', speculative_config=None, tokenizer='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:34 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:37609 backend=nccl
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:34 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(StageEngineCoreProc pid=9834) WARNING 05-05 23:24:35 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=9834) WARNING 05-05 23:24:35 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:35 [gpu_model_runner.py:4777] Starting to load model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base...
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [cuda.py:368] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [flash_attn.py:646] Using FlashAttention version 2
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [voice_cache.py:43] Voice embedding cache initialized (max_entries=128)
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [weight_utils.py:904] Filesystem type for checkpoints: OVERLAY. Checkpoint size: 3.59 GiB. Available RAM: 450.39 GiB.
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:36 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (OVERLAY) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.41it/s]
(StageEngineCoreProc pid=9834) 
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:37 [qwen3_tts_talker.py:1656] Loaded 396 weights for Qwen3TTSTalkerForConditionalGeneration
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:37 [default_loader.py:384] Loading weights took 0.90 seconds
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [gpu_model_runner.py:4879] Model loading took 3.63 GiB memory and 1.898549 seconds
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:38 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [backends.py:1069] Using cache directory: /root/.cache/vllm/torch_compile_cache/5edd4a18a8/rank_0_0/backbone for vLLM's torch.compile
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:42 [backends.py:1128] Dynamo bytecode transform time: 4.06 s
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [backends.py:290] Directly load the compiled graph(s) for compile range (1, 512) from the cache, took 1.249 s
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [decorators.py:305] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/3e289124d765cb14db476cf476e22bf6f15ab2acf33192b13414f2f7efd02f88/rank_0_0/model
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [monitor.py:53] torch.compile took 6.05 s in total
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:44 [monitor.py:81] Initial profiling/warmup run took 0.17 s
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:45 [base.py:163] Available KV cache memory: 5.72 GiB (profiling fallback)
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:45 [kv_cache_utils.py:1711] GPU KV cache size: 53,504 tokens
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:45 [kv_cache_utils.py:1716] Maximum concurrency for 4,096 tokens per request: 13.06x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████| 5/5 [00:00<00:00, 21.85it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 26.35it/s]
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:46 [gpu_model_runner.py:6133] Graph capturing finished in 1 secs, took 0.06 GiB
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:46 [core.py:299] init engine (profile, create kv cache, warmup model) took 8.40 s (compilation: 6.05 s)
(StageEngineCoreProc pid=9834) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(StageEngineCoreProc pid=9834) WARNING 05-05 23:24:47 [scheduler.py:181] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARAsyncScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:47 [factory.py:46] Created connector: SharedMemoryConnector
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:47 [vllm.py:840] Asynchronous scheduling is enabled.
(APIServer pid=9684) INFO 05-05 23:24:47 [async_omni_engine.py:484] [AsyncOmniEngine] Stage 0 engine startup completed
(StageEngineCoreProc pid=9834) INFO 05-05 23:24:47 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=9684) INFO 05-05 23:24:47 [async_omni_engine.py:467] [AsyncOmniEngine] Stage 1 engine launch started
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:24:56 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:24:56 [nixl_utils.py:44] NIXL agent config is not available
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:58 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', speculative_config=None, tokenizer='/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [65536], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto')
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:58 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:57047 backend=nccl
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(StageEngineCoreProc pid=10102) WARNING 05-05 23:24:59 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:24:59 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:59 [gpu_model_runner.py:4777] Starting to load model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base...
(StageEngineCoreProc pid=10102) INFO 05-05 23:24:59 [default_loader.py:384] Loading weights took 1749513.05 seconds
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [gpu_model_runner.py:4879] Model loading took 0.0 GiB memory and 0.002176 seconds
(StageEngineCoreProc pid=10102) `torch_dtype` is deprecated! Use `dtype` instead!
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [modeling_qwen3_tts_tokenizer_v2.py:969] Precomputed exp caches for 29 SnakeBeta activations
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:00 [cuda_graph_decoder_wrapper.py:105] Starting CUDA Graph warmup for 11 sizes: [2, 4, 8, 16, 25, 32, 64, 97, 128, 256, 325]
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=2
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=4
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=8
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:02 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=16
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=25
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=32
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=64
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=97
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=128
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=256
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:118]   Captured CUDA Graph for size=325
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [cuda_graph_decoder_wrapper.py:123] CUDA Graph warmup complete: 11/11 captured
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [modeling_qwen3_tts_tokenizer_v2.py:999] CUDA Graph enabled for decoder: seq_lens=[2, 4, 8, 16, 25, 32, 64, 97, 128, 256, 325]
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [qwen3_tts_code2wav.py:158] Code2Wav decoder CUDA Graph enabled
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:03 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:03 [gpu_generation_model_runner.py:472] Dummy sampler run is not implemented for generation model
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:03 [core.py:306] init engine (profile, create kv cache, warmup model) took 3.34 s
(StageEngineCoreProc pid=10102) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [scheduler.py:181] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [core.py:138] Disabling chunked prefill for model without KVCache
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [factory.py:46] Created connector: SharedMemoryConnector
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [vllm.py:840] Asynchronous scheduling is enabled.
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [vllm.py:896] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:04 [vllm.py:914] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'])
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [vllm.py:1089] Cudagraph is disabled under eager mode
(APIServer pid=9684) INFO 05-05 23:25:04 [async_omni_engine.py:484] [AsyncOmniEngine] Stage 1 engine startup completed
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:04 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:134] [StageEngineCoreClient] Stage-0 initializing EngineCore
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:134] [StageEngineCoreClient] Stage-1 initializing EngineCore
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:174] [StageEngineCoreClient] Stage-1 EngineCore running
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:25:04 [stage_engine_core_client.py:174] [StageEngineCoreClient] Stage-0 EngineCore running
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:487] talker_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:490] speaker_encoder_config is None. Initializing talker model with default values
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:05 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=9684) INFO 05-05 23:25:05 [async_omni_engine.py:702] [AsyncOmniEngine] Stage 1 initialized
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:05 [async_omni_engine.py:702] [AsyncOmniEngine] Stage 0 initialized
(APIServer pid=9684) INFO 05-05 23:25:05 [orchestrator.py:192] [Orchestrator] Starting event loop
(APIServer pid=9684) INFO 05-05 23:25:05 [async_omni_engine.py:378] [AsyncOmniEngine] Orchestrator ready with 2 stages
(APIServer pid=9684) INFO 05-05 23:25:05 [omni_base.py:166] [AsyncOmni] AsyncOmniEngine initialized in 52.99 seconds
(APIServer pid=9684) INFO 05-05 23:25:05 [omni_base.py:185] [AsyncOmni] Initialized with 2 stages for model /root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base
(APIServer pid=9684) INFO 05-05 23:25:06 [api_server.py:651] Supported tasks: {'generate', 'speech'}
(APIServer pid=9684) WARNING 05-05 23:25:06 [model.py:1437] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.9, 'max_tokens': 8192}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:06 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=9684) WARNING 05-05 23:25:06 [serving_speech.py:401] No speakers found in config (checked spk_id and speaker_id)
(APIServer pid=9684) WARNING 05-05 23:25:06 [serving_speech.py:234] Uploaded voices are ephemeral and will be lost on server restart. Re-upload voices after each restart if needed.
(APIServer pid=9684) INFO 05-05 23:25:06 [serving_speech.py:242] Loaded 0 supported speakers: []
(APIServer pid=9684) INFO 05-05 23:25:06 [serving_speech.py:293] Loaded codec frame rate: 12.5 Hz (output_sample_rate=24000, encode_downsample_rate=1920)
(APIServer pid=9684) INFO 05-05 23:25:06 [serving.py:45] OpenAIServingRealtime initialized for task: realtime
(APIServer pid=9684) INFO 05-05 23:25:06 [api_server.py:424] Starting vLLM API server 0 on http://127.0.0.1:34085
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:37] Available routes are:
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/speech, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/speech/batch, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/generate, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/voices, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/voices, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/audio/voices/{name}, Methods: DELETE
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/images/generations, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/images/edits, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/sync, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/{video_id}, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/{video_id}, Methods: DELETE
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/videos/{video_id}/content, Methods: GET
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/omni/sleep, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:46] Route: /v1/omni/wakeup, Methods: POST
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:57] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:57] Route: /v1/video/chat/stream, Endpoint: streaming_video_chat
(APIServer pid=9684) INFO 05-05 23:25:06 [launcher.py:57] Route: /v1/realtime, Endpoint: realtime_websocket
(APIServer pid=9684) INFO:     Started server process [9684]
(APIServer pid=9684) INFO:     Waiting for application startup.
(APIServer pid=9684) INFO:     Application startup complete.
Server ready on 127.0.0.1:34085
OmniServer started successfully

=== PRE-TEST GPU CLEANUP ===

Skipping GPU memory cleanup check (typically: instance already up; no check needed between tests)

--- Running test: test_text_to_audio_001[async_chunk]
(APIServer pid=9684) The tokenizer you are loading from '/root/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-870395dad3e04f08: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-870395dad3e04f08 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-ad4b71432ea76f84: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-870395dad3e04f08
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-98c95225cab8a7a1: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-a4550679581b1dc5: text='The weather is nice today, perfect for a walk in t...', model=Base
(APIServer pid=9684) INFO 05-05 23:25:08 [serving_speech.py:1822] TTS speech request speech-ac4cfcbae885fe1d: text='The weather is nice today, perfect for a walk in t...', model=Base
(StageEngineCoreProc pid=9834) WARNING 05-05 23:25:08 [gpu_model_runner.py:390] additional_information on request data is deprecated, use model_intermediate_buffer
(StageEngineCoreProc pid=9834) WARNING 05-05 23:25:08 [gpu_model_runner.py:1145] additional_information on scheduled_cached_reqs is deprecated, use model_intermediate_buffer
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-870395dad3e04f08
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-ad4b71432ea76f84 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-ad4b71432ea76f84
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-ad4b71432ea76f84
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-98c95225cab8a7a1 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-98c95225cab8a7a1
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-98c95225cab8a7a1
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-a4550679581b1dc5 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-a4550679581b1dc5
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-a4550679581b1dc5
(APIServer pid=9684) INFO 05-05 23:25:08 [orchestrator.py:901] [Orchestrator] _handle_add_request: stage=0 req=speech-ac4cfcbae885fe1d prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
(APIServer pid=9684) INFO 05-05 23:25:08 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-0 adding request: speech-ac4cfcbae885fe1d
(APIServer pid=9684) INFO 05-05 23:25:09 [stage_engine_core_client.py:230] [StageEngineCoreClient] Stage-1 adding request: speech-ac4cfcbae885fe1d
(StageEngineCoreProc pid=9834) `torch_dtype` is deprecated! Use `dtype` instead!
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:09 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:09 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(StageEngineCoreProc pid=9834) WARNING 05-05 23:25:09 [gpu_model_runner.py:1497] _merge_additional_information_update is deprecated, use _update_intermediate_buffer
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:14 [qwen3_code_predictor.py:568] code_predictor: warmup done for buckets [1, 2, 4, 8, 10]
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:14 [qwen3_code_predictor.py:588] code_predictor: captured CUDA graphs for buckets [1, 2, 4, 8, 10]
(StageEngineCoreProc pid=9834) INFO 05-05 23:25:14 [qwen3_code_predictor.py:536] code_predictor: torch.compile (no epilogue fusion) + CUDA graphs
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:15 [gpu_model_runner.py:390] additional_information on request data is deprecated, use model_intermediate_buffer
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:15 [gpu_model_runner.py:1145] additional_information on scheduled_cached_reqs is deprecated, use model_intermediate_buffer
(StageEngineCoreProc pid=10102) INFO 05-05 23:25:15 [qwen3_tts_code2wav.py:288] Code2Wav codec: frames=103 q=16 uniq=1072 range=[1,2047] batch=1
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:18 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35138 - "POST /v1/audio/speech HTTP/1.1" 200 OK
audio data is saved: ./test_b3faaa6eeb58497ab9ca371ca23b00d4.wav
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:18 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35140 - "POST /v1/audio/speech HTTP/1.1" 200 OK
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:21 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35156 - "POST /v1/audio/speech HTTP/1.1" 200 OK
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:21 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35162 - "POST /v1/audio/speech HTTP/1.1" 200 OK
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
(StageEngineCoreProc pid=10102) WARNING 05-05 23:25:22 [qwen3_tts_code2wav.py:260] Code2Wav input_ids length 1 not divisible by num_quantizers 16; skipping malformed request.
(APIServer pid=9684) INFO:     127.0.0.1:35170 - "POST /v1/audio/speech HTTP/1.1" 200 OK
WARNING 05-05 23:25:27 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:25:27 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 16.661387036787346
audio content is: The weather is nice today. Perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
config.json: 2.65kB [00:00, 7.32MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████| 378M/378M [03:23<00:00, 1.86MB/s]
preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████| 215/215 [00:00<00:00, 1.41MB/s]
Device set to use cpu
gender classifier: label=женский, conf=0.984, gender=female, median_f0=240.6Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_b25c3b3d7f754866a4d97ca7e77a35ab.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:29:12 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:29:12 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 17.73810344398953
audio content is: The weather is nice today, perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.986, gender=female, median_f0=250.0Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_5060caf337fa40ccbade0dcc9f0cd82a.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:29:29 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:29:29 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 16.998421740951017
audio content is: The weather is nice today, perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.990, gender=female, median_f0=262.3Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_40c98cf4facc46d6b10cfa93c31a0cad.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:29:46 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:29:46 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 16.983736565103754
audio content is: The weather is nice today perfect for a walk in the park
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.978, gender=female, median_f0=254.0Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
audio data is saved: ./test_a939a4ef782c429f9e2b5f8c4c514a93.wav
/root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
 --> vLLM version 0.20.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
WARNING 05-05 23:30:04 [nixl_utils.py:34] NIXL is not available
WARNING 05-05 23:30:04 [nixl_utils.py:44] NIXL agent config is not available
the avg e2e latency is: 17.315035382052884
audio content is: The weather is nice today. Perfect for a walk in the park.
input text is: The weather is nice today, perfect for a walk in the park.
cosine similarity text1 is: the weather is nice today perfect for a walk in the park, text2 is: the weather is nice today perfect for a walk in the park
Cosine similarity: 1.000
gender classifier: label=женский, conf=0.976, gender=female, median_f0=223.8Hz
Preset voice gender check: preset='clone', estimated='female', expected='female'
.
Skipping GPU memory cleanup check (typically: instance already up; no check needed between tests)

OmniServer stopping...
(StageEngineCoreProc pid=10102) INFO 05-05 23:30:12 [core.py:1238] Shutdown initiated (timeout=0)
(StageEngineCoreProc pid=9834) INFO 05-05 23:30:12 [core.py:1238] Shutdown initiated (timeout=0)
(StageEngineCoreProc pid=10102) INFO 05-05 23:30:12 [core.py:1261] Shutdown complete
(StageEngineCoreProc pid=9834) INFO 05-05 23:30:12 [core.py:1261] Shutdown complete
[rank0]:[W505 23:30:13.533936353 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W505 23:30:13.815573753 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=9684) ERROR 05-05 23:30:14 [stage_engine_core_client.py:201] [StageEngineCoreClient] Stage-1 subprocess died unexpectedly (exit code None).
(APIServer pid=9684) ERROR 05-05 23:30:14 [stage_engine_core_client.py:201] [StageEngineCoreClient] Stage-0 subprocess died unexpectedly (exit code None).
(APIServer pid=9684) INFO 05-05 23:30:22 [omni_base.py:456] [AsyncOmni] Shutting down
(APIServer pid=9684) INFO 05-05 23:30:22 [async_omni_engine.py:1785] [AsyncOmniEngine] Shutting down Orchestrator
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:252] [Orchestrator] Received shutdown signal
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:1221] [Orchestrator] Shutting down all stages
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:1225] [Orchestrator] Stage 0 shut down
(APIServer pid=9684) INFO:     Shutting down
(APIServer pid=9684) INFO 05-05 23:30:22 [orchestrator.py:1225] [Orchestrator] Stage 1 shut down
(APIServer pid=9684) INFO 05-05 23:30:22 [launcher.py:137] Shutting down FastAPI HTTP server.
(APIServer pid=9684) INFO 05-05 23:30:22 [omni_base.py:456] [AsyncOmni] Shutting down
(APIServer pid=9684) INFO:     Shutting down
(APIServer pid=9684) INFO:     Waiting for application shutdown.
(APIServer pid=9684) INFO:     Application shutdown complete.
Pre-test GPU status:
[GPU Memory Monitor] Waiting for GPU 0 to free memory, Condition: Memory usage ratio ≤ 5.0%
[GPU Memory Status] Current usage:
  GPU 0: 0.5GiB/32.0GiB (1.6%)
[GPU Memory Freed] Devices 0 meet memory condition
   Condition: Memory usage ratio ≤ 5.0%
   Wait time: 0.0 seconds (0.0 minutes)
Post-test GPU status:

================================================================================
NVIDIA GPU Information (nvidia-smi)
================================================================================
Tue May  5 23:30:24 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:4F:00.0 Off |                  N/A |
| 30%   33C    P8             13W /  320W |       1MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

================================================================================
Detailed GPU Processes (nvidia-smi pmon)
================================================================================
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0          -     -      -      -      -      -      -      -    -              


================================================================================
System Processes with GPU keywords
================================================================================
OmniServer stopped


========================================================= warnings summary ==========================================================
.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1434
  /root/vllm-omni/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1434: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

vllm_omni/version.py:55
  /root/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
   --> vLLM-Omni version 0.19.0rc2.dev202+g8e64c7ef2
   --> vLLM version 0.20.0
  This will likely cause compatibility issues.
    warn_if_misaligned_vllm_version()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

.venv/lib/python3.12/site-packages/torch/jit/_script.py:365: 14 warnings
  /root/vllm-omni/.venv/lib/python3.12/site-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
============================================ 1 passed, 18 warnings in 392.15s (0:06:32) =============================================

@lishunyang12 lishunyang12 added the omni-test label to trigger buildkite omni model test in nightly CI label May 5, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

please fix ci

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 6, 2026

please fix ci

@linyueqian
Copy link
Copy Markdown
Collaborator

@yenuo26 @hsliuustc0106 i think this pr is good to merge. please take another look.

@linyueqian linyueqian added tts-test label to trigger buildkite tts models test in nightly CI ready label to trigger buildkite CI labels May 6, 2026
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented May 6, 2026

@amy-why-3459 https://buildkite.com/vllm/vllm-omni/builds/8939/canvas?sid=019df8f9-159d-43d0-8b59-530fe5f1dcdc&tab=output, please check this performance. It seems that it doesn't happen regression.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@yenuo26 @hsliuustc0106 i think this pr is good to merge. please take another look.

I will merge it after CI

@amy-why-3459
Copy link
Copy Markdown
Contributor

@amy-why-3459 https://buildkite.com/vllm/vllm-omni/builds/8939/canvas?sid=019df8f9-159d-43d0-8b59-530fe5f1dcdc&tab=output, please check this performance. It seems that it doesn't happen regression.

LGTM

@hsliuustc0106 hsliuustc0106 merged commit b076006 into vllm-project:main May 6, 2026
8 checks passed
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: xiaohajiayou <923390377@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI omni-test label to trigger buildkite omni model test in nightly CI ready label to trigger buildkite CI tts-test label to trigger buildkite tts models test in nightly CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Voxtral-4B-TTS-2603 fails to start unless --skip-mm-profiling is explicitly set

8 participants